32 research outputs found
The Visual Centrifuge: Model-Free Layered Video Representations
True video understanding requires making sense of non-lambertian scenes where
the color of light arriving at the camera sensor encodes information about not
just the last object it collided with, but about multiple mediums -- colored
windows, dirty mirrors, smoke or rain. Layered video representations have the
potential of accurately modelling realistic scenes but have so far required
stringent assumptions on motion, lighting and shape. Here we propose a
learning-based approach for multi-layered video representation: we introduce
novel uncertainty-capturing 3D convolutional architectures and train them to
separate blended videos. We show that these models then generalize to single
videos, where they exhibit interesting abilities: color constancy, factoring
out shadows and separating reflections. We present quantitative and qualitative
results on real world videos.Comment: Appears in: 2019 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2019). This arXiv contains the CVPR Camera Ready version of
the paper (although we have included larger figures) as well as an appendix
detailing the model architectur
Apprentissage structuré à partir de vidéos et langage
The goal of this thesis is to develop models, representations and structured learning algorithms for the automatic understanding of complex human activities from instructional videos narrated with natural language. We first introduce a model that, given a set of narrated instructional videos describing a task, is able to generate a list of action steps needed to complete the task and locate them in the visual and textual streams. To that end, we formulate two assumptions. First, people perform actions when they mention them. Second, we assume that complex tasks are composed of an ordered sequence of action steps. Equipped with these two hypotheses, our model first clusters the textual inputs and then uses this output to refine the location of the action steps in the video. We evaluate our model on a newly collected dataset of instructional videos depicting 5 different complex goal oriented tasks. We then present an approach to link action and the manipulated objects. More precisely, we focus on actions that aim at modifying the state of a specific object, such as pouring a coffee cup or opening a door. Such actions are an inherent part of instructional videos. Our method is based on the optimization of a joint cost between actions and object states under constraints. The constraints are reflecting our assumption that there is a consistent temporal order for the changes in object states and manipulation actions. We demonstrate experimentally that object states help localizing actions and conversely that action localization improves object state recognition. All our models are based on discriminative clustering, a technique which allows to leverage the readily available weak supervision contained in instructional videos. In order to deal with the resulting optimization problems, we take advantage of a highly adapted optimization technique: the Frank-Wolfe algorithm. Motivated by the fact that scaling our approaches to thousands of videos is essential in the context of narrated instructional videos, we also present several improvements to make the Frank- Wolfe algorithm faster and more computationally efficient. In particular, we propose three main modifications to the Block-Coordinate Frank-Wolfe algorithm: gap-based sampling, away and pairwise Block Frank-Wolfe steps and a solution to cache the oracle calls. We show the effectiveness of our improvements on four challenging structured prediction tasks.Le but de cette thĂšse est de dĂ©velopper des modĂšles, des reprĂ©sentations adaptĂ©es et des algorithmes de prĂ©diction structurĂ©e afin de pouvoir analyser de maniĂšre automatique des activitĂ©s humaines complexes commentĂ©es par du langage naturel. Dans un premier temps, nous prĂ©sentons un modĂšle capable de dĂ©couvrir quelle est la liste dâactions nĂ©cessaires Ă lâaccomplissement de la tĂąche ainsi que de localiser ces actions dans le flux vidĂ©o et dans la narration textuelle Ă partir de plusieurs vidĂ©os tutorielles. La premiĂšre hypothĂšse est que les gens rĂ©alisent les actions au moment oĂč ils les dĂ©crivent. La seconde hypothĂšse est que ces tĂąches complexes sont rĂ©alisĂ©es en suivant un ordre prĂ©cis dâactions.. Notre modĂšle est Ă©valuĂ© sur un nouveau jeu de donnĂ©es de vidĂ©os tutorielles qui dĂ©crit 5 tĂąches complexes. Nous proposons ensuite de relier les actions avec les objets manipulĂ©s. Plus prĂ©cisĂ©ment, on se concentre sur un type dâaction particuliĂšre qui vise Ă modifier lâĂ©tat dâun objet. Par exemple, cela arrive lorsquâon sert une tasse de cafĂ© ou bien lorsquâon ouvre une porte. Ce type dâaction est particuliĂšrement important dans le contexte des vidĂ©os tutorielles. Notre mĂ©thode consiste Ă minimiser un objectif commun entre les actions et les objets. Nous dĂ©montrons via des expĂ©riences numĂ©riques que localiser les actions aident Ă mieux reconnaitre lâĂ©tat des objets et inversement que modĂ©liser le changement dâĂ©tat des objets permet de mieux dĂ©terminer le moment oĂč les actions se dĂ©roulent. Tous nos modĂšles sont basĂ©s sur du partionnement discriminatif, une mĂ©thode qui permet dâexploiter la faible supervision contenue dans ce type de vidĂ©os. Cela se rĂ©sume Ă formuler un problĂšme dâoptimisation qui peut se rĂ©soudre aisĂ©ment en utilisant lâalgorithme de Frank- Wolfe qui est particuliĂšrement adaptĂ© aux contraintes envisagĂ©es. MotivĂ© par le fait quâil est trĂšs important dâĂȘtre en mesure dâexploiter les quelques milliers de vidĂ©os qui sont disponibles en ligne, nous portons enfin notre effort Ă rendre lâalgorithme de Frank-Wolfe plus rapide et plus efficace lorsque confrontĂ© Ă beaucoup de donnĂ©es. En particulier, nous proposons trois modifications Ă lâalgorithme Block-Coordinate Frank-Wolfe : un Ă©chantillonage adaptatif des exemples dâentrainement, une version bloc des âaway stepsâ et des âpairwise stepsâ initialement prĂ©vu dans lâalgorithme original et enfin une maniĂšre de mettre en cache les appels Ă lâoracle linĂ©aire
Learning to Localize and Align Fine-Grained Actions to Sparse Instructions
Automatic generation of textual video descriptions that are time-aligned with
video content is a long-standing goal in computer vision. The task is
challenging due to the difficulty of bridging the semantic gap between the
visual and natural language domains. This paper addresses the task of
automatically generating an alignment between a set of instructions and a first
person video demonstrating an activity. The sparse descriptions and ambiguity
of written instructions create significant alignment challenges. The key to our
approach is the use of egocentric cues to generate a concise set of action
proposals, which are then matched to recipe steps using object recognition and
computational linguistic techniques. We obtain promising results on both the
Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions
Dataset
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
Cross-task weakly supervised learning from instructional videos
In this paper we investigate learning visual models for the steps of ordinary
tasks using weak supervision via instructional narrations and an ordered list
of steps instead of strong supervision via temporal annotations. At the heart
of our approach is the observation that weakly supervised learning may be
easier if a model shares components while learning different steps: `pour egg'
should be trained jointly with other tasks involving `pour' and `egg'. We
formalize this in a component model for recognizing steps and a weakly
supervised learning framework that can learn this model under temporal
constraints from narration and the list of steps. Past data does not permit
systematic studying of sharing and so we also gather a new dataset, CrossTask,
aimed at assessing cross-task sharing. Our experiments demonstrate that sharing
across tasks improves performance, especially when done at the component level
and that our component model can parse previously unseen tasks by virtue of its
compositionality.Comment: 18 pages, 17 figures, to be published in proceedings of the CVPR,
201
Controllable Attention for Structured Layered Video Decomposition
The objective of this paper is to be able to separate a video into its
natural layers, and to control which of the separated layers to attend to. For
example, to be able to separate reflections, transparency or object motion. We
make the following three contributions: (i) we introduce a new structured
neural network architecture that explicitly incorporates layers (as spatial
masks) into its design. This improves separation performance over previous
general purpose networks for this task; (ii) we demonstrate that we can augment
the architecture to leverage external cues such as audio for controllability
and to help disambiguation; and (iii) we experimentally demonstrate the
effectiveness of our approach and training procedure with controlled
experiments while also showing that the proposed model can be successfully
applied to real-word applications such as reflection removal and action
recognition in cluttered scenes.Comment: In ICCV 201
Multi-Task Learning of Object State Changes from Uncurated Videos
We aim to learn to temporally localize object state changes and the
corresponding state-modifying actions by observing people interacting with
objects in long uncurated web videos. We introduce three principal
contributions. First, we explore alternative multi-task network architectures
and identify a model that enables efficient joint learning of multiple object
states and actions such as pouring water and pouring coffee. Second, we design
a multi-task self-supervised learning procedure that exploits different types
of constraints between objects and state-modifying actions enabling end-to-end
training of a model for temporal localization of object states and actions in
videos from only noisy video-level supervision. Third, we report results on the
large-scale ChangeIt and COIN datasets containing tens of thousands of long
(un)curated web videos depicting various interactions such as hole drilling,
cream whisking, or paper plane folding. We show that our multi-task model
achieves a relative improvement of 40% over the prior single-task methods and
significantly outperforms both image-based and video-based zero-shot models for
this problem. We also test our method on long egocentric videos of the
EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the
robustness of our learned model